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Abstract 

We  discuss  an  adaptive  flow  control  mechanism  for  the 
Swift/RAID  distributed  file  system.  Our  goal  is  to  achieve  near- 
optimal  performance  on  heterogeneous  networks  where  available 
load  capacity  varies  due  to  other  network  traffic.  The  original 
Swift/RAID  prototype  used  synchronous  communication,  achiev¬ 
ing  throughput  considerably  less  than  available  network  capacity. 
We  designed  and  implemented  an  adaptive  flow  control  mecha¬ 
nism  that  provides  greatly  improved  performance. 

Our  design  uses  a  simple  automatic  repeat  request  (ARQ)  go 
back  N  protocol  coupled  with  the  congestion  avoidance  and  con¬ 
trol  mechanism  developed  for  fhe  Transmission  Confrol  Protocol 
(TCP).  The  Swift /RAID  implementation  contains  a  transfer  plan  ex¬ 
ecutor  to  isolate  all  of  the  communications  code  from  fhe  rest  of 
Swift.  The  adaptive  flow  control  design  was  implemented  entirely 
in  this  module. 

Results  from  experimental  data  show  the  adaptive  design 
achieving  an  increase  in  throughput  for  reads  from  671  KB/s  for 
fhe  original  synchronous  implementation  to  927  KB/s  (a  38%  in¬ 
crease)  for  the  adaptive  prototype,  and  an  increase  from  375  KB/ s 
to  559  KB/ s  (a  49%  increase)  in  write  throughput. 


1  Introduction 

Multimedia  and  scientific  visualization  require  huge  files 
containing  images  or  digitized  sounds.  Current  systems  can 
only  offer  a  fraction  of  the  data  rates  required  by  these  ap¬ 
plications.  The  Swift  distributed  file  system  architecture  [4] 
was  introduced  to  address  this  problem.  Here  we  consider 
the  problem  of  using  the  system  in  an  efficient  marmer  to 
maximize  network  throughput. 

Due  to  network  flow  problems,  synchronous  operation 
of  Swift /RAID  was  thought  to  be  necessary.  However,  this 
resulted  in  diminished  throughput  due  to  the  large  waiting 
times.  To  achieve  a  higher  throughput  an  adaptive  flow 
control  scheme  has  been  designed  and  implemented.  The 
design  is  based  on  an  ARQ  go  back  N  protocol  and  includes 
the  adaptive  congestion  avoidance  and  control  techniques 
used  for  the  Transmission  Control  Protocol  (TCP)  [7].  This 
design  has  allowed  asynchronous  operation  of  the  protot5q)e 
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with  an  increase  in  throughput  from  671  KB  /s  to  927  KB  /  s 
over  the  synchronous  operation  for  RAID  level  4  reads  and 
from  380  KB/s  to  482  KB/s  for  writes,  and  RAID  level  5 
shows  an  increase  in  throughput  from  729  KB /s  to  896  KB  /  s 
for  reads  and  from  375  KB/s  to  559  KB/s  for  writes.  Our 
design  has  also  attained  an  increase  of  28%  over  the  reported 
throughput  of  the  original  non-redundant  Swift  prototype 
[4,  9]  (about  700  KB/s  versus  927  KB/s  for  the  new  design) 
for  read  operations  (both  RAID  levels  4  and  5),  and  has 
achieved  one  half  of  the  write  throughput  for  RAID  level  4 
(about  900  KB/s  versus  482  KB/s  for  the  new  design)  and 
60%  of  the  throughput  for  RAID  level  5  write  operations 
(about  900  KB/s  versus  559  KB/s). 


2  Swift  Distributed  File  System  Architecture 

Our  goal  is  to  study  how  the  Swift  [4,  3]  architecture  can 
make  the  most  effective  use  of  available  network  capacity. 
Swift  is  designed  to  support  high  data  rates  in  a  general  pur¬ 
pose  distributed  system.  It  is  built  on  the  notion  of  striping 
data  over  multiple  storage  agents  and  driving  them  in  paral¬ 
lel.  It  assumes  that  data  objects  are  produced  and  consumed 
by  clients  and  that  the  objects  are  managed  by  the  several 
components  of  Swift.  In  particular,  the  distribution  agents, 
storage  mediators,  and  storage  agents  are  involved  in  plarming 
and  actual  data  transfer  operations  between  the  client  and 
an  array  of  disks,  which  are  the  principal  storage  media.  We 
refer  the  reader  to  [4,  3]  for  details  of  the  functionality  of 
these  components  of  Swift. 

The  communications  protocols  used  in  the  original  Swift 
system  operated  in  a  s5mchronous  marmer.  Packets  were 
transmitted  one  at  a  time  and  the  sender  waited  until  an 
acknowledgment  was  received  from  the  destination  before 
new  packets  were  sent.  This  type  of  operation  amounts  to  an 
automatic  repeat  request  (ARQ)  Stop-and-Wait  protocol  [1]. 
The  protocol  is  simple  and  error  free,  but  does  not  achieve 
as  high  a  throughput  as  is  possible  with  other  protocols  such 
as  ARQ  Go  back  N  [1]. 

The  Swift /RAID  prototype  was  developed  to  add  redun¬ 
dancy  to  the  original  Swift  system  [9].  The  protot5q)e  uses  a 
transfer  plan  executor  to  execute  a  transfer  plan  set  (one  trans¬ 
fer  plan  set  is  generated  per  user  request).  Execution  begins 
when  the  client  transfer  plan  executor  has  sent  the  transfer 
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plan  to  each  server.  Each  transfer  plan  executor  then  steps 
through  the  transfer  plan  in  parallel,  executing  fhe  individ¬ 
ual  insfrucfions.  The  insfrucfion  op-codes  available  allow 
all  of  fhe  basic  file  operations  and  parify  calculations.  Also 
included  are  synchronization  primifives  fo  provide  synchro¬ 
nization  befween  fhe  clienf  and  servers.  The  profofype  im- 
plemenfs  fhe  RAID-0  (no  parify),  RAID-4  (fixed  parify  node) 
and  RAID-5  (disfribufed  parify  node)  configurations. 

The  profofype  handles  errors  in  fwo  ways  -  time-out,  or 
restart.  If  a  node  is  waiting  for  an  insfrucfion  from  ifs  peer 
and  fhe  insfrucfion  is  nof  received  from  fhe  nefwork  wifhin 
a  fixed  fime,  fhe  waiting  node  issues  a  restart  insfrucfion  fo 
ifs  peer.  Or,  if  a  node  receives  an  insfrucfion  ouf  of  sequence 
a  restart  is  also  generafed  fo  re-synchronize  fhe  insfrucfion 
plans.  In  eifher  case,  we  assume  a  packef  has  been  losf  on 
fhe  nefwork,  very  likely  due  fo  congestion. 


3  Improving  Network  Throughput 

Our  goal  is  to  maximize  the  network  throughput  be¬ 
tween  a  client  making  requests  to  the  file  system  and  the 
servers  in  the  file  system,  and  additionally,  that  of  maintain¬ 
ing  good  utilization  in  a  heterogeneous  environment.  To 
maximize  the  performance  of  the  Swift /RAID  protot5q)e, 
the  following  three  issues  need  to  be  addressed:  satura¬ 
tion  of  client/ server  network  buffers;  expense  of  time-outs; 
and  the  variability  and  unpredictability  of  available  network 
load  capacity. 

Burst  mode  operation  can  easily  flood  the  network  and 
overwhelm  the  Sun  implementation  of  the  User  Data¬ 
gram  Protocol  (UDP).  The  Swift/RAID  protot5q)e  transac¬ 
tion  driver,  using  a  burst  mode  of  operation,  can  push  data  to 
the  UDP  layer  too  quickly,  causing  data  to  be  dropped  and 
throughput  to  approach  zero. 

An  examination  of  the  Sun  operating  system  kernel 
source  code  showed  that  packets  are  likely  to  be  lost  while 
waiting  for  transmission  in  the  Ethernet  queue.  This  queue 
is  the  final  location  for  a  datagram  before  being  placed  onto 
the  physical  network.  The  queue  size  is  limited  to  50  Eth¬ 
ernet  packets.*  When  the  queue  exceeds  the  maximum 
allowed,  it  randomly  removes  packets  from  the  queue  and 
drops  them  without  notifying  any  other  part  of  the  system. 
There  is  no  obvious  way  to  determine  when  this  happens 
and  it  makes  timely  congestion  detection  extremely  diffi¬ 
cult.  Consider  that  an  8  KB  datagram  packet  is  fragmented 
into  five  full  Ethernet  packets  and  one  partial  packet.  When 
the  client  makes  a  write  request  to  a  three  node  system  of 
over  three  packets  per  node  in  one  pass,  the  client  kernel 
Ethernet  queue  is  immediately  inundated  with  54  packets 
to  deliver.  The  kernel  simply  starts  dropping  the  packets 
and  throughput  is  seriously  diminished. 


‘Increasing  the  kernel  buffers  is  a  temporary  fix  -  it  just  masks  the 
underlying  problem. 


3.1  Flow  Control  Mechanisms 

The  underlying  system  can  be  thought  of  as  a  group  of 
buffers  intercormected  by  a  fixed  capacity  pipe.  The  servers 
each  have  a  pair  of  buffers  for  the  client:  one  for  input  and 
one  for  output.  The  client  has  one  pair  of  buffers  for  each 
server  that  cormect  to  another  pair  of  buffers  for  the  network. 
Each  of  these  buffers  have  finite  capacities  in  addition  to  the 
fixed  capacity  of  the  pipe.  When  any  of  these  capacities 
are  exceeded  packets  will  be  dropped.  These  buffers  can 
also  be  thought  of  as  windows  providing  an  upper  bound  on 
the  amount  of  data  a  transmitter  can  expect  to  send  before 
receiving  an  acknowledgment  from  the  destination  for  more 
data. 

Because  of  the  buffering  needs  of  the  Swift /RAID  sys¬ 
tem,  it  lends  itself  to  a  class  of  flow  control  mechanisms 
known  as  window  flow  control.  The  communications  prim¬ 
itives  available  to  us  at  the  client/server  session  level,  the 
specific  class  of  end-to-end  windowing  (also  know  as  entry-to- 
exit  flow  control  [6])  make  it  well  suited  for  our  prototype. 
In  this  scheme,  the  sender  has  knowledge  of  the  destina¬ 
tion's  buffer  capacity  and  only  sends  packets  out  in  batches 
of  sizes  less  up  to  the  available  buffer  capacity.  The  desti¬ 
nation  sends  permits  (acknowledgments)  of  the  packets  re¬ 
ceived.  Upon  receiving  the  permit  from  the  destination  the 
sender  continues  by  sending  another  batch  of  packets.  This 
method  is  efficient,  and  can  approach  optimal  performance 
of  a  given  system  [8].  A  disadvantage  of  this  scheme  is  that 
of  choosing  a  window  size.  The  choice  of  a  window  size 
is  a  trade-off:  small  window  sizes  limit  the  congestion  and 
tend  to  avoid  large  delays,  and  large  window  sizes  allow 
full-speed  transmission  and  maximum  throughput  under 
lightly  loaded  conditions.  One  solution  to  the  dynamic  win¬ 
dow  adjustment  was  suggested  by  the  congestion  avoidance 
and  control  mechanisms  in  4.3  BSD  Reno  TCP  [7]. 


4  Implementation  of  the  Adaptive  Prototype 

Our  design  addresses  network  buffer  saturation,  avoid¬ 
ing  time-outs  and  adjusting  to  variable  network  capacity. 
We  chose  an  ARQ  Go  Back  N  protocol  [1]  as  our  main  mech¬ 
anism.  To  handle  the  avoidance  of  time-outs  and  variable 
network  capacity,  we  have  added  congestion  avoidance  sim¬ 
ilar  to  that  used  for  TCP  [7]. 

The  Swift/ RAID  architecture  implementation  is  modular, 
with  all  of  the  communications  code  placed  in  one  highly  co¬ 
hesive  module,  the  transaction  driver  module.  All  modifica¬ 
tions  to  accomplish  the  flow-control/congestion  avoidance 
were  applied  to  this  module:  all  versions  of  the  protot5q)e 
remained  operational.  The  original  implementation  of  the 
RAID-0,  RAID-4  and  RAID-5  systems  were  used,  the  only 
difference  being  increased  throughput. 

The  transaction  driver  implementation  uses  two  main 
data  structures  to  control  its  operations.  These  contain  the 
set  of  instructions  for  the  current  plan  being  executed.  The 
client  and  each  server  keep  them  until  the  successful  com- 


pletion  of  each  plan.  They  also  contain  information  about 
the  individual  communicating  entities  on  both  sides:  the 
client  has  one  structure  for  each  server  it  is  using,  and  each 
server  has  a  structure  for  each  client  it  is  serving. 

The  data  structures  were  modified  to  store  window 
and  congestion  information  for  each  communications  link. 
Counters  in  each  structure  reflect  the  last  packet  each  server 
has  acknowledged,  and  the  last  packet  acknowledged  by 
the  client  to  the  particular  server.  At  any  time  if  the  dif¬ 
ference  between  the  counters  is  greater  than  the  allowable 
window  size,  further  transmissions  are  delayed  until  addi¬ 
tional  acknowledgments  are  received  that  reflect  available 
buffer  space  at  the  receiver. 

A  few  simple  lines  of  code  were  added  to  the  instruc¬ 
tion  transmission  routine  to  do  the  difference  calculation 
and  comparison  with  the  available  window  size.  The  code 
transmits  a  small  burst  of  packets  equal  to  the  available  win¬ 
dow  size.  The  same  code  was  added  to  a  similar  section  of 
the  instruction  execution  routine  and  is  executed  after  the 
system  receives  a  new  packet. 

A  fast  retransmit  mechanism  was  added  through  a  sep¬ 
arate  routine  in  the  transaction  driver  module.  We  use  a 
similar  technique  to  that  found  in  TCP  for  approximating 
the  variance  and  determining  the  new  round  trip  time  value. 
The  calculated  round  trip  time  is  used  by  the  respective 
sender  in  a  session  waiting  on  an  acknowledgment  from  its 
peer  for  the  last  packet  sent.  If  the  sender  does  not  receive  an 
acknowledgment  within  the  round  trip  time,  packets  are  re¬ 
sent  from  the  last  acknowledged  packet  through  the  current 
available  window. 

We  deviate  significantly  from  the  actual  implementation 
of  the  algorithms  in  TCP.  First,  we  do  not  allow  our  con¬ 
gestion  avoidance  algorithm  to  probe  past  the  known  win¬ 
dow  size.  Our  Swift/RAID  system  has  one  link  between  the 
transmitter  and  destination  nodes  (and  there  are  no  interme¬ 
diate  links  or  gateways  to  buffer  packets  while  in  transit). 
Probing  past  the  window  size  would  only  cause  packets  to  be 
lost  and  throughput  to  drop  accordingly.  Second,  the  Reno 
TCP  implementation  waits  for  either  a  time-out  or  three  du¬ 
plicate  acknowledgments  from  the  destination  to  react  to 
the  apparently  lost  packet  and  trigger  the  retransmission  of 
the  packet.  We  use  the  restart  mechanism  in  the  transaction 
driver  to  send  a  restart  packet  for  time-outs  or  receipt  of  an 
out-of-order  (dropped)  packet  condition,  and  we  trigger  re¬ 
transmission  on  the  first  receipt  of  this  packet.  Finally,  in 
calculating  the  round  trip  times  we  use  a  much  finer  gran¬ 
ularity  of  timer  than  the  actual  TCP  implementation.  The 
4.3  BSD  Reno  TCP  uses  a  coarse  grained  timer  of  around 
500ms.  Our  implementation  reads  the  system  clock  as  each 
packet  is  sent  from  the  transaction  driver,  and  again  when 
each  acknowledgment  is  received,  using  the  difference  as 
our  measured  round  trip  time.  We  then  use  a  round  trip 
time  estimator  to  compute  our  retransmit  timer.  This  gives 
us  a  more  accurate  time-out  calculation  for  retransmissions. 


5  Results 

Experiments  were  performed  for  all  available  versions  of 
the  system,  including  RAID-0,  RAID-4  and  RAID-5.  Read 
and  write  throughput  was  measured  for  file  transfers  up 
to  one  megabyte  and  compared  with  the  original  non- 
redundant  prototype  as  well  as  the  Swift/RAID  protot5q)e. 
The  throughput  measurements  performed  to  evaluate  the 
prototype  were  essentially  the  same  as  those  reported  in  [9]. 
In  fact,  the  identical  test  programs  and  RAID-4  and  RAID-5 
modules  were  used. 

5.1  Methodology 

The  block  size  was  changed  from  8192  bytes  (in  the 
Swift/RAID  prototype)  to  7340  bytes  to  avoid  fragmenta¬ 
tion  of  the  Ethernet  packets.  Because  we  were  interested 
in  network  performance,  the  files  were  preallocated  on  the 
servers  so  that  file  creation  was  not  reflected  in  the  results. 
The  individual  experiments  were  repeated  50  times  each  and 
averaged  to  obtain  the  results  reported  here. 

The  Swift /RAID  architecture  uses  a  60  byte  header  on  the 
datagram  in  addition  to  the  data  block  being  transferred. 
Therefore  a  data  block  size  of  7340  bytes  was  used,  which 
when  added  to  the  60  byte  header  gives  a  7400  byte  datagram 
that  fragments  into  exactly  five  1480  byte  Ethernet  packets 
with  no  internal  fragmentation.  This  maximizes  use  of  the 
network  resource  and  increases  throughput  slightly  over  the 
original  8192  byte  block  size  used  in  previous  experiments. 

To  establish  credibility  of  the  data  and  the  data  gath¬ 
ering  techniques  a  t5q)ical  experimental  run  was  analyzed 
to  determine  the  standard  deviation  and  confidence  inter¬ 
vals.  The  experiment  was  run  for  each  block  size  using  the 
Swift /RAID  adaptive  protot5q)e  rurming  on  a  three  node 
RAID-5  system.  The  90%  confidence  intervals  ranged  from 
±0.4%  to  ±8.11%,  with  a  mean  interval  of  ±1.91%  for  reads 
and  ±3.17%  for  writes.  All  of  the  numbers  for  the  adaptive 
Swift /RAID  prototype  in  this  report  are  mean  values  with 
similar  confidence  intervals.  Eigure  1  shows  the  through¬ 
put  for  both  reads  and  writes  with  error  bars  for  the  90% 
confidence  intervals. 

The  testing  platform  was  a  heterogeneous  local  area  net¬ 
work  consisting  of  Sun  SparcStations,  including  a  Spare- 
Station  2,  a  SparcStation  IPX,  a  SparcStation  IPC  and  three 
SparcStation  SLCs.  The  intercormection  medium  was  a  10 
Mb  /  s  Ethernet.  The  SparcStation  2  was  used  as  the  client 
in  all  of  the  measurements  that  follow.  The  balance  of  the 
machines  were  used  as  the  Swift /RAID  servers,  with  the 
SparcStation  IPX  and  SparcStation  IPC  always  included,  and 
one  or  more  of  the  SparcStation  SLCs  added  in  as  necessary 
for  the  experiment.  The  client  machine  was  also  the  NFS  file 
server  and  gateway  for  the  subnet  described.  This  accen¬ 
tuated  the  inabilities  of  the  workstation  to  handle  the  loads 
presented,  but  the  workstation  was  used  for  historical  rea¬ 
sons  (comparison  with  previous  results).  In  addition,  the 
network  was  an  active  network  heavily  used  by  researchers 
and  with  a  fluctuating  load.  These  activities  and  fluctua- 


RAID-5  SWIFT  -  Comparison  of  Window  Size  of  2  showing  90%  Confidence  Intervals 


0.00  5.00  10.00  15.00  20.00  25.00  30.00  35.00  40.00  45.00  50.00 

File  size  read  In  7340  Byte  blocks 

Figure  1:  Swift/RAID-5  protot5^e  performance  with  90% 
confidence  intervals  shown. 

tions  are  reflected  in  the  data  collected  and  appear  in  the 
graphs  as  small  bumps  and  occasionally  more  severe  dips. 
To  capture  this  activity  and  loading  in  a  more  direct  manner 
we  used  a  Network  General  Distributed  Sniffer  System  to 
monitor  network  utilization. 

5.2  Performance  Evaluation 

The  throughput  obtained  for  reads  for  the  RAID-0  pro¬ 
totype  using  the  Stop  and  Wait  protocol  ranged  from  651  to 
758  KB/s.  Results  for  the  ARQ  Go  Back  N  protocol  with 
a  window  size  of  one  are  very  similar  to  those  of  the  Stop 
and  Wait,  as  expected  (Stop  and  Wait  is  the  same  as  ARQ 
Go  Back  1).  When  the  window  is  increased  to  two,  reads 
improve  slightly  to  689  -  830  KB/s.  As  the  window  size 
is  increased  up  to  five,  the  read  operations  level  off  to  910 
KB/s.  Results  for  the  RAID-4  and  RAID-5  prototypes  are 
similar  and  push  the  throughput  on  the  network  to  927  KB  /  s 
and  896  KB/s,  respectively.  (Figures  2,  3  and  4  show  read 
and  write  throughput  of  a  t5^ical  file  transfer  (40  blocks  - 
293.6  KB)  for  window  sizes  from  one  to  five.) 

For  the  RAID-0  protot5^e  it  can  be  seen  that  the  addition 
of  the  adaptive  ARQ  Go  Back  N  control  improved  read  op¬ 
erations  by  as  much  as  40%  for  a  two  node  system.  The  im¬ 
provement  is  less  dramatic  for  larger  configurations.  How¬ 
ever,  it  should  be  noted  that  in  all  cases  the  read  operation 
is  brought  up  to  about  900  KB/s,  which  is  very  close  to 
the  usable  bandwidth  of  the  network  [2],  making  further 
improvements  difficult.  RAID-4  read  operations  show  an 
improvement  of  35%  over  the  Stop  and  Wait  protocol  (from 
671  KB/s  to  927  KB/s)  and  RAID-5  improved  up  to  23% 
(from  729  KB/s  to  896  KB/s). 

Write  operations  for  the  RAID-0  Stop  and  Wait  prototype 
ranged  from  616  to  766  KB  / s.  For  our  new  prototype  writes 
achieved  throughputs  from  768  to  800  KB/s.  Write  opera¬ 
tions  continue  at  about  800  KB/s,  except  for  the  case  of  four 
nodes.  Here  the  client  is  sending  based  on  a  window  size 
of  at  least  three  Swift  data  unit  packets  to  four  servers  at 
once.  One  Swift  data  unit  is  7400  bytes  with  header,  which 


is  five  Ethernet  packets  on  our  network  configuration.  This, 
times  three  for  the  window,  times  four  for  the  servers  equals 
60  Ethernet  packets.  Write  throughput  starts  to  drop  off  as 
the  number  of  blocks  transmitted  by  the  client  reaches,  or 
exceeds,  the  capacity  of  the  kernel  buffers.  The  Sun  im¬ 
plementation  of  UDP  randomly  discards  packets  from  its 
kernel  network  buffers  when  it  receives  more  packets  than 
it  can  handle,  and  our  kernel  network  buffers  were  set  at 
50  packets  (the  original  configuration  from  Sun).  When  the 
client  sends  the  60  packets  to  be  transmitted,  the  kernel  is 
throwing  some  of  them  away  causing  the  throughput  to 
drop  accordingly.  These  can  be  seen  in  Figure  2. 

RAID-4  write  operations  did  not  improve  over  the  Stop 
and  Wait  prototype  except  for  the  case  where  the  window 
size  is  equal  to  two.  In  this  case  the  adaptive  protot5q)e 
maintains  a  throughput  of  482  KB  /  s  compared  to  the  400 
KB/s  attained  by  the  Stop  and  Wait  prototype  -  an  improve¬ 
ment  of  20%.  Write  operations  for  the  RAID-4  redundant 
systems  suffer  from  some  inherent  problems.  Parity  block 
traffic,  the  extra  traffic  for  small  writes  and  the  GPU  over¬ 
head  used  for  parity  calculations  all  adversely  impact  the 
RAID-4  system.  Parity  block  traffic  is  not  serious  for  sys¬ 
tems  with  many  nodes  to  stripe  across,  but  for  systems  with 
small  numbers  of  nodes  (like  ours)  it  is  a  major  contribut¬ 
ing  factor  in  limiting  throughput.  The  impact  of  the  small 
write  traffic  decreases  as  the  number  of  blocks  requested  in¬ 
creases,  but  is  a  major  contributor  to  poor  performance  for 
small  block  requests.  Parity  calculations  have  been  shown 
to  be  a  factor  in  limiting  the  ability  of  the  workstations  used 
to  achieve  better  performance.  Experiments  have  found  the 
cost  of  parity  to  be  200KB  /sin  the  Swift /RAID  Stop  and  Wait 
protot5q)e  using  a  SparcStation  2  [9].  Additionally,  the  bot¬ 
tleneck  created  by  the  use  of  a  single  parity  node  becomes 
as  issue  for  large  writes,  and  tends  to  keep  RAID-4  perfor¬ 
mance  below  others,  such  as  RAID-5.  Also,  as  node  and 
window  sizes  increase,  we  run  into  the  same  buffer  limita¬ 
tions  as  we  did  with  RAID-0. 

Results  for  RAID-5  operation  show  up  to  a  49%  increase 
in  write  operation  throughput  (from  375  KB  /s  to  559  KB /s) . 
RAID-5  write  operations  do  best  at  the  window  sizes  of  two 
and  three,  and  diminish  above  that,  especially  for  the  larger 
server  populations.  (Again,  the  Sun  operating  system  kernel 
buffers  are  being  exhausted  and  packets  are  being  randomly 
dropped  from  the  queue,  causing  retransmissions  and  lower 
throughput.) 

RAID-5  writes  don't  do  as  well  in  all  cases,  but  in  general 
do  as  well  as  the  Stop  and  Wait  protot5q)e,  or  better.  As  for 
RAID-4,  RAID-5  has  some  similar  inherent  problems.  The 
parity  block  traffic,  small  write  traffic  and  GPU  costs  for  the 
parity  calculations  all  take  their  tolls  on  RAID-5  systems. 
The  parity  block  traffic  is  the  over-riding  factor  in  limiting 
throughput  for  small  numbers  of  nodes. 

Results  from  the  use  of  the  network  analyzer  were  in¬ 
teresting  in  that  they  show  both  read  and  write  operations 
utilize  the  network  close  to  maximum  capacity.  Why  then  do 
writes  seem  to  perform  so  poorly  in  our  experiments?  The 
low  performance  of  the  writes  is  directly  attributable  to  the 
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Figure  2:  RAID-0  adaptive  prototype  performance. 
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Figure  3:  RAID-4  adaptive  prototype  performance. 


cosf  of  redundancy  -  fhe  parify  block  fhaf  musf  be  updafed 
for  each  wrife  operafion.  For  a  full  sfripe  wrife  n  —  1  blocks 
are  wriffen  fo  dafa  nodes,  and  one  parify  block  is  wriffen. 
The  penaify  here  is  fhaf  1/n  of  fhe  nefwork  fraffic  is  being 
spenf  fo  preserve  redundancy.  When  n  is  large  fhis  cosf  is 
small.  However  in  our  profofype  sysfem  n  is  small  (i.e.,  3,4 
and  5  nodes  were  used)  and  fherefore  fhe  cosf  is  high.  In 
addifion,  small  wrifes  impacf  wrifes  fhaf  are  nof  full  sfripes, 
and  cosf  af  leasf  an  exfra  two  blocks  (reading  the  old  data 
block,  and  the  old  parity  block)  of  fraffic  plus  fhe  rewrifing 
of  fhe  parify  block.  For  a  fhree  node  sysfem  fhe  sfripe  is  fwo 
blocks  in  size,  and  small  wrifes  impacf  50%  of  fhe  block  sizes 
used  in  our  experimenfs.  Similarly  in  a  four  node  sysfem 
small  wrifes  impacf  33%  of  fhe  block  sizes.  As  fhe  number 
of  blocks  wriffen  becomes  large  fhe  sysfem  wrifes  fo  mosfly 
full  sfripes,  wifh  one  small  wrife  af  fhe  end  if  fhe  fofal  size 
does  nof  fall  on  a  full  sfripe  boundary.  This  lessens  fhe  over¬ 
all  affecf  of  fhe  small  wrifes  so  fhe  fhroughpuf  of  fhe  sysfem 
is  confrolled  by  fhe  cosfs  of  fhe  single  parify  block  fraffic. 


5.3  Degraded  Mode  Performance  Evaluation: 

RAID-4  and  RAID-5 

For  fhese  experimenfs  fhe  Swiff /RAID  adaptive  profo¬ 
fype  was  operafed  wifh  a  window  size  of  fwo.  An  operating 
node  was  ferminafed  before  each  experimenf  was  sfarfed. 
The  RAID-4  sysfem  was  fesfed  bofh  wifh  fhe  ferminafed 
node  as  fhe  parify  node  or  as  one  of  fhe  dafa  nodes.  RAID-5 
was  fesfed  wifh  a  random  node  failure.  Figure  5  shows  fhe 
resulfs  on  bofh  RAID-4  and  RAID-5  for  our  profof5^e  in 
degraded  mode  operafion. 

In  degraded  mode  operafion,  RAID-4  and  RAID-5  sys- 
fems  musf  reconsfrucf  dafa  on  fhe  failed  node  from  fhe  re¬ 
maining  nodes  in  fhe  sysfem.  RAID-4  has  fwo  cases  fo  con¬ 
sider:  fhe  case  of  a  failed  dafa  node,  and  fhe  case  of  fhe  parify 
node.  If  can  be  seen  fhaf  wifh  a  failed  dafa  node  fhe  read 
operafion  fhroughpuf  drops  from  927  KB/s  fo  760  KB/s,  or 
18%.  For  wrife  operations  fhe  fhroughpuf  increases  by  26% 
from  482  KB/s  fo  608  KB/s.  For  fhe  case  where  a  RAID-4 
parify  node  fails  fhe  read  operafions  are  unaffecfed,  while 
wrife  operafions  improve  fhe  same  percenfage  over  normal 
operafion  as  for  a  dafa  node  failure.  The  improvemenf  in 
fhe  wrife  performance  is  due  fo  a  significanf  decrease  in 
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Figure  4:  RAID-5  adaptive  protot5rpe  performance. 


network  traffic  due  to  lost  redundancy.  When  a  data  node 
has  failed,  only  the  parity  block  is  written;  the  write  to  the 
failed  node  is  deferred  until  the  failed  node  is  reconstructed. 
When  the  parity  node  has  failed,  only  the  data  blocks  are 
written,  and  parity  calculation  and  write  are  deferred  until 
reconstruction. 

For  RAID-5  systems  there  is  only  one  degraded  mode 
since  the  parity  information  is  distributed  evenly  across  all 
nodes.  Read  operations  show  a  decrease  in  throughput  from 
896  KB/s  to  768  KB/s  (a  loss  of  17%).  Write  operations 
improved  from  559  KB/s  to  714  KB/s,  an  increase  of  27%. 
Writes  improve  for  the  same  reasons  as  in  RAID-4,  and  since 
RAID-5  nodes  contain  both  data  and  parity  information, 
both  causes  apply  to  a  failed  RAID-5  node. 

Our  results  agree  with  other  research  that  has  shown  the 
throughput  for  writes  in  systems  of  fewer  than  four  nodes 
actually  increase  during  degraded  mode  for  both  RAID-4 
and  RAID-5  systems  [10].  These  results  show  a  decrease 
in  total  load  due  to  writes  for  RAID-5  systems  where  the 
number  of  disks  is  less  than  eight,  and  average  load  decrease 
for  fewer  than  four  disks.  It  can  similarly  be  shown  for 
RAID-4  systems  with  a  decrease  in  total  load  for  less  than 
eight  disks,  and  a  decrease  in  average  load  for  less  than  five 
disks  in  the  system. 

6  Conclusions 

An  adaptive  Flow  Control  Mechanism  has  been  added 
to  the  prototype  for  the  Swift /RAID  distributed  file  sys¬ 
tem  protot5^e.  This  mechanism  has  allowed  the  prototype 
to  achieve  a  greater  than  25%  increase  for  read  operation 
throughput,  and  up  to  a  50%  increase  for  write  operations 
over  the  previous  Swift/RAID  prototype.  The  adaptive 
RAID-4  and  RAID-5  protot5^es  are  both  able  to  achieve 
read  throughputs  in  excess  of  900  KB/s. 

Bertsekas  and  Gallager  [1]  have  discussed  that  one  of  the 
limitations  of  end-to-end  window  flow  control  is  the  trade¬ 
off  of  choosing  a  window  size  -  small  window  sizes  keep 
packets  in  the  subnet  low  and  congestion  to  a  minimum. 


but  large  windows  allow  higher  rates  of  transmission  and 
maximum  throughput  during  light  traffic  conditions.  They 
have  also  suggested  that  the  value  should  be  between  n  and 
3n,  where  n  is  the  path  length  between  the  nodes.  Our 
results  agree  and  have  shown  that  a  window  size  of  two 
(on  our  local  network  with  a  path  length  of  one)  provided 
the  maximal  throughput  for  write  operations  -  on  the  order 
of  a  50%  increase  over  the  Stop  and  Wait  protocol.  Other 
window  sizes  of  1,  3, 4,  5  showed  little,  or  no  improvement 
for  writes  and  in  the  case  for  a  window  size  of  five  the  results 
tended  to  be  below  all  others  because  of  the  swamping  of  the 
client  Ethernet  queue.  This  is  also  supported  by  Eldridge  [5]. 
Read  operations  did  better  with  higher  window  sizes,  but 
not  significantly  better  (approximately  5%  in  most  cases). 

Most  of  our  improvement  in  throughput  was  gained  by 
the  simple  "self-clocking"  [7]  of  the  data  packets  with  the 
acknowledgments  sent  by  the  destination  as  packets  were 
received.  This  is  because  once  the  packet  traffic  has  sta¬ 
bilized,  the  acknowledgment  packets  are  being  returned  at 
the  rate  at  which  the  receiver  is  pulling  packets  off  of  the 
network.  Likewise,  the  sender  is  receiving  the  acknowledg¬ 
ments  at  the  same  rate  and  putting  new  packets  into  the 
network.  This  has  the  affect  that  as  the  receiver  is  taking  one 
packet  off,  the  sender  is  simultaneously  putting  another  one 
into  the  network. 

One  way  to  improve  small  writes  is  through  a  server- 
server  protocol.  It  can  be  shown  that  a  small  write  can  be 
accomplished  with  one  data  packet  sent  from  the  client  to 
the  server  node  for  storage,  and  one  interim  parity  packet 
sent  directly  to  the  parity  node  from  the  server  receiving 
the  new  data  packet.  To  take  advantage  of  this  technique 
a  protocol  for  server  to  server  communication  needs  to  be 
designed  and  implemented.  This  would  allow  for  the  small 
write  problem  to  be  reduced  from  four  network  instruction 
transfers,  all  of  which  involve  the  client  node,  to  two  disk 
operation  instruction  packets  sent  over  the  network,  only 
one  of  which  would  involve  the  client. 


RAID-4  SWIFT  •  Degrade  Mode:  Failed  Data  Node  -  Window  Size  of  2  RAID-5  SWIFT  -  Degrade  Mode:  One  Failed  Node  -  Window  Size  of  2 


Figure  5:  Comparison  of  Swift/RAID  protot5^e  read  and  write  performance  in  degraded  mode. 
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